1 Overview

We are strong supporters of open-source tools for reproducible research. Priority index or Pi is developed as a genomic-led target prioritisation system, with the focus on leveraging genetic data to prioritise targets at the gene and pathway level. Based on evidence of genetic association from GWAS data, this prioritisation system is able to resolve modulated genes (seed genes) by utilising knowledge of linkage disequilibrium (co-inherited variants), distance from the gene, and evidence of genetic association with gene expression. Seed genes are scored in an integrative way quantifying the genetic influence. Scored seed genes are subsequently used as baits to rank seed genes plus additional (non-seed) genes; this is achieved by iteratively exploring the global connectivity of a gene interaction network. Genes with the top priority are further used to identify/prioritise pathways that are significantly enriched with highly prioritised genes. Prioritised genes are also used to identify a gene network interconnecting many highly prioritised genes but a few lowly prioritised genes as linkers. See below for the workflow.

2 Installation

This intends for end-users who are comfortable with R (with the latest version available at http://cran.r-project.org). Pi together with dependant packages (such as XGR) can be installed following instructions below:

# install the dependant packages including `XGR` and others
source("http://bioconductor.org/biocLite.R")
biocLite(c("XGR","devtools"), siteRepos=c("http://cran.r-project.org"))
library(devtools)
install_github(c("hfang-bristol/XGR", "hfang-bristol/dnet"), dependencies=TRUE)

# also, install the `Pi` package itself
library(devtools)
install_github(c("hfang-bristol/Pi"), dependencies=TRUE)

3 R functions

Priority functions are designed in a nested way. The core relation follows this route: xPierSNPs -> xPierGenes -> xPier -> xRWR, achieving gene-level prioritisation from an input list of SNPs (along with their significant level). The output of this route is taken as the input of either xPierManhattan for visualising gene priority scores, xPierPathways for prioritising pathways, or xPierSubnet for identifying a network of top prioritised genes.

3.1 xRWR

xRWR: implements Random Walk with Restart (RWR) estimating the affinity score of nodes in a graph to a list of seed nodes. The affinity score can be viewed as the influential impact over the graph that is collectively imposed by seed nodes. If the seeds are not given, it will pre-compute affinity matrix for nodes in the input graph with respect to each starting node (as a seed) by looping over every node in the graph. A higher-level function xPier directly relies on it.

3.2 xPier

xPier: uses RWR to calculate the affinity score of nodes in a graph to a list of seed nodes. A node that has a higher affinity score to seed nodes will receive a higher priority score. It is an internal function acting as a general template for RWR-based prioritisation. A higher-level function xPierGenes directly relies on it.

3.3 xPierGenes

xPierGenes: prioritises gene targets from an input gene interaction network and the score info imposed on its seed nodes/genes. This function can be considered to be a specific instance of xPier, that is, specifying a gene interaction network as a graph and seed nodes as seed genes.

There are two sources of gene interaction network information: the STRING database [@Szklarczyk2015] and the Pathways Commons database [@Cerami2011]. STRING is a meta-integration of undirect interactions from a functional aspect, while Pathways Commons mainly contains both undirect and direct interactions from a physical/pathway aspect. In addition to interaction type, users can choose the interactions of varying quality:

Code Interaction (type and quality) Database
STRING_high Functional interactions (with high confidence scores>=700) STRING
STRING_medium Functional interactions (with medium confidence scores>=400) STRING
PCommonsUN_high Physical/undirect interactions (with references & >=2 sources) Pathways Commons
PCommonsUN_medium Physical/undirect interactions (with references & >=1 sources) Pathways Commons
PCommonsDN_high Pathway/direct interactions (with references & >=2 sources) Pathways Commons
PCommonsDN_medium Pathway/direct interactions (with references & >=1 sources) Pathways Commons

3.4 xPierSNPs

xPierSNPs: prioritises gene targets from an input gene interaction network and a given list of SNPs together with the significance level (eg GWAS reported p-values). To do so, it first defines seed genes and their scores that are calculated in an integrative manner to quantify the genetic influence under SNPs. Genetic influential score on a seed gene is calculated from the SNP score (reflecting the SNP significant level), the gene-to-SNP distance weight and the eQTL mapping weight. This function can be considered to be a specific instance of xPierGenes, that is, specifying seed genes plus their scores.

Knowledge of co-inherited variants is also used to include additional SNPs that are in Linkage Disequilibrium (LD) with GWAS lead SNPs. LD SNPs are calculated based on 1000 Genomes Project data [@Project2012]. LD SNPs are defined to be any SNPs having R2 > 0.8 with GWAS lead SNPs. The user can choose the population used to calculate LD SNPs:

Code Population Project
AFR African 1000 Genomes Project
AMR Admixed American 1000 Genomes Project
EAS East Asian 1000 Genomes Project
EUR European 1000 Genomes Project
SAS South Asian 1000 Genomes Project

3.5 xPierManhattan

xPierManhattan: visualises prioritised genes using manhattan plot, in which priority for genes is displayed on the Y-axis along with genomic locations on the X-axis. Also highlighted are genes with the top priority.

3.6 xPierPathways

xPierPathways: prioritises pathways based on enrichment analysis of genes with the top priority (eg top 100 genes) using a compendium of pathways. A highly prioritised pathway has its member genes with high gene-level priority scores, that is, having evidence of direct modulation by disease-associated lead (or LD) SNPs, and/or having evidence of indirect modulation at the network level.

In addition to pathways, enrichment analysis can be extended to other ontologies, as listed below:

Code Ontology Category
DO Disease Ontology Disease
GOMF Gene Ontology Molecular Function Function
GOBP Gene Ontology Biological Process Function
GOCC Gene Ontology Cellular Component Function
HPPA Human Phenotype Phenotypic Abnormality Phenotype
HPMI Human Phenotype Mode of Inheritance Phenotype
HPCM Human Phenotype Clinical Modifier Phenotype
HPMA Human Phenotype Mortality Aging Phenotype
MP Mammalian/Mouse Phenotype Phenotype
DGIdb DGI druggable gene categories Druggable
SF SCOP domain superfamilies Domain
PS2 phylostratific age information (our ancestors) Evolution
MsigdbH Hallmark gene sets Hallmark (MsigDB)
MsigdbC1 Chromosome and cytogenetic band positional gene sets Cytogenetics (MsigDB)
MsigdbC2CGP Chemical and genetic perturbation gene sets Perturbation (MsigDB)
MsigdbC2CPall All pathway gene sets Pathways (MsigDB)
MsigdbC2CP Canonical pathway gene sets Pathways (MsigDB)
MsigdbC2KEGG KEGG pathway gene sets Pathways (MsigDB)
MsigdbC2REACTOME Reactome pathway gene sets Pathways (MsigDB)
MsigdbC2BIOCARTA BioCarta pathway gene sets Pathways (MsigDB)
MsigdbC3TFT Transcription factor target gene sets TF targets (MsigDB)
MsigdbC3MIR microRNA target gene sets microRNA targets (MsigDB)
MsigdbC4CGN Cancer gene neighborhood gene sets Cancer (MsigDB)
MsigdbC4CM Cancer module gene sets Cancer (MsigDB)
MsigdbC5BP GO biological process gene sets Function (MsigDB)
MsigdbC5MF GO molecular function gene sets Function (MsigDB)
MsigdbC5CC GO cellular component gene sets Function (MsigDB)
MsigdbC6 Oncogenic signature gene sets Oncology (MsigDB)
MsigdbC7 Immunologic signature gene sets Immunology (MsigDB)

3.7 xPierSubnet

xPierSubnet: identifies a gene network that contains as many highly prioritised genes as possible but a few lowly prioritised genes as linkers. An iterative procedure of choosing different priority cutoff is also used to identify the network with a desired number of nodes/genes.

4 Showcases

In this section, we use GWAS SNPs about an inflammatory disease Spondyloarthritis (including Ankylosing Spondylitis and Psoriatic Arthritis) as a case study, and give a step-by-step demo showing how to use Pi to achieve disease-specific genetic prioritisation of targets at the gene, pathway and network level.

4.1 Input data

Spondyloarthritis-associated GWAS lead SNPs are collected mainly from GWAS Catalog, complemented by ImmunoBase and latest publications.

data.file <- "http://galahad.well.ox.ac.uk/bigdata/Spondyloarthritis.txt"
data <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)

The first 15 rows of the data frame data are shown below, with the column SNP for AS GWAS SNPs and the column Pval for GWAS-detected P-values.

SNP Pval
rs6600247 2.6e-15
rs11209026 2.0e-27
rs11209026 9.1e-14
rs12141575 9.4e-11
rs4129267 3.4e-13
rs1801274 1.4e-09
rs41299637 1.9e-15
rs1250550 1.5e-09
rs1250550 1.5e-09
rs1250550 1.5e-09
rs11190133 4.9e-14
rs1860545 2.8e-10
rs11065898 4.7e-08
rs11624293 1.5e-10
imm_16_28525386 2.6e-09

4.2 Gene-level prioritisation

It includes the following steps:

  1. Define seed genes based on distance-to-SNP window and genetic association with gene expression: nearby genes that are located within defined distance window (by default, 50kb) of lead or Linkage Disequilibrium (LD) SNPs that are calculated based on European population data from 1000 Genome Project; expression associated genes (eQTL genes) for which gene expression is, either in a cis- or trans-acting manner, significantly associated with lead or LD SNPs, based on immune-related eQTL data.

  2. Score seed genes to quantify the genetic influence under lead or LD SNPs.

  3. Prioritise target genes by estimating their global network connectivity to seed genes. With scored seed genes superposed onto a gene interaction network (see above STRING_high), RWR algorithm is implemented to prioritise candidate target genes based on their network connectivity/affinity to seed genes. As such, a gene that has a higher affinity score to seed genes will receive a higher priority score.

Specify parameters

include.eQTL <- c("JKscience_TS2A","JKscience_TS2B","JKscience_TS3A","JKng_bcell","JKng_mono","JKnc_neutro","JK_nk", "GTEx_V4_Whole_Blood")
include.LD <- 'EUR'
LD.r2 <- 0.8
significance.threshold <- 5e-5
distance.max <- 50000
decay.kernel <- "rapid"
decay.exponent <- 2
GR.SNP <- "dbSNP_GWAS"
GR.Gene <- "UCSC_knownGene"
cdf.function <- "empirical"
scoring.scheme <- 'max'
network <- "STRING_high"
restart <- 0.75
RData.location <- "http://galahad.well.ox.ac.uk/bigdata"

Do prioritisation

pNode <- xPierSNPs(data, include.LD=include.LD, LD.r2=LD.r2, significance.threshold=significance.threshold, distance.max=distance.max, decay.kernel=decay.kernel, decay.exponent=decay.exponent, GR.SNP=GR.SNP, GR.Gene=GR.Gene, include.eQTL=include.eQTL, cdf.function=cdf.function, scoring.scheme=scoring.scheme, network=network, restart=restart, RData.location=RData.location)

The results are stored in the data frame pNode$priority, which can be saved into a file Genes_priority.txt:

write.table(pNode$priority, file="Genes_priority.txt", sep="\t", row.names=FALSE)

Visualise in manhattan plot

Top genes can be highlighted in manhattan plot, in which priority scores for genes are displayed on the Y-axis along with genomic locations on the X-axis.

mp_AS <- xPierManhattan(pNode, highlight.top=20, cex=0.5, highlight.col="red", highlight.label.size=3, RData.location=RData.location)
print(mp_AS)

4.3 Pathway-level prioritisation

Pathway-level prioritisation is based on top 100 genes using a compendium of pathways from diverse sources (Canonical, KEGG, BioCarta and Reactome). Since diverse sources are used, it is necessary to remove redundant pathways (of the same granularity to the similar pathways with higher priority scores).

eTerm <- xPierPathways(pNode, priority.top=100, ontology="MsigdbC2CPall", RData.location=RData.location)
eTerm_nonred <- xEnrichConciser(eTerm)

# view the top pathways/terms
xEnrichViewer(eTerm_nonred)

# save results to a file `Pathways_priority.nonredundant.txt`
Pathways_nonred <- xEnrichViewer(eTerm_nonred, top_num=length(eTerm_nonred$adjp), sortBy="fdr", details=TRUE)
output <- data.frame(term=rownames(Pathways_nonred), Pathways_nonred)
write.table(output, file="Pathways_priority.nonredundant.txt", sep="\t", row.names=FALSE)

Barplot of prioritised pathways (non-redundant and informative):

bp <- xEnrichBarplot(eTerm_nonred, top_num="auto", displayBy="fdr", FDR.cutoff=1e-3, wrap.width=100)
bp

4.4 Network-level prioritisation

Network-level prioritisation is to identify a gene network that contains as many highly prioritised genes as possible but a few lowly prioritised genes as linkers. Given a gene network (the same as used in gene-level prioritisation) with nodes labelled with gene priority scores, the search for a maximum-scoring gene subnetwork is briefed as follows:

  1. score transformation, that is, given a threshold of tolerable priority scores, nodes above this threshold (nodes of interest) are scored positively, and negative scores for nodes below the threshold (intolerable).

  2. subnetwork identification, that is, to find an interconnected gene subnetwork enriched with positive-score nodes but allowing for a few negative-score nodes as linkers, done via heuristically solving prize-collecting Steiner tree problem.

  3. controlling the subnetwork size, that is, an iterative procedure of scanning different priority thresholds is used to identify the network with a desired number of nodes/genes.

Notably, the preferential use of the same network as used in gene-level prioritisation is due to the fact that gene-level affinity/priority scores are smoothly distributed over the network after being walked. In other words, the chance of identifying such a gene network enriched with top prioritised genes is much higher. To reduce the runtime, by default only top 10% prioritised genes will be used to search for the maximum-scoring gene network.

# find maximum-scoring gene network with the desired node number=50
g <- xPierSubnet(pNode, priority.quantite=0.1, subnet.size=50, RData.location=RData.location)

The identified gene network has nodes/genes colored according to their priority scores (see below). Notably, if nodes appear abnormally, please remove vertex.shape="sphere" when running the function xVisNet.

pattern <- as.numeric(V(g)$priority)
zmax <- ceiling(quantile(pattern,0.75)*1000)/1000
xVisNet(g, pattern=pattern, vertex.shape="sphere", colormap="yr", newpage=FALSE, edge.arrow.size=0.3, vertex.label.color="blue", vertex.label.dist=0.35, vertex.label.font=2, zlim=c(0,zmax))

5 Session Info

Here is the output of sessionInfo() on the system on which this user manual was built:

> R version 3.2.4 (2016-03-10)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: OS X 10.11.6 (El Capitan)
> 
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
> 
> attached base packages:
> [1] grid      stats     graphics  grDevices utils     datasets  methods  
> [8] base     
> 
> other attached packages:
> [1] png_0.1-7       BiocStyle_1.6.0
> 
> loaded via a namespace (and not attached):
>  [1] magrittr_1.5    formatR_1.3     tools_3.2.4     htmltools_0.3  
>  [5] yaml_2.1.13     stringi_1.1.1   rmarkdown_0.9.5 highr_0.5.1    
>  [9] knitr_1.12.3    stringr_1.1.0   digest_0.6.10   evaluate_0.8.3